Normal distribution & central limit theorem

Lets assume the monthly growth rate follows following distribution:

What is the distribution of heights of 10000 children at different ages?

This is an example the displays the central limit theorem, which states that the result of processes that manifest as the sum of many small identical and independently distributed events are normally distributed.

One way to explain why this is the case is to see that there are more possible combinations of events that lead to average outcomes than possible combination of events that lead to extreme events.

For instance, assume that you are throwing a fair coin four times, and each time heads shows you receive one credit point and each time tail shows you loos a credit point. The next table shows that there are more possible sequences that lead to an end result of 0 credit points than sequences that lead to 4 or more credit points.

Permutation event 1 event 2 event 3 event 4 sum
1 -1 -1 -1 -1 -4
2 1 -1 -1 -1 -2
3 -1 1 -1 -1 -2
4 1 1 -1 -1 0
5 -1 -1 1 -1 -2
6 1 -1 1 -1 0
7 -1 1 1 -1 0
8 1 1 1 -1 2
9 -1 -1 -1 1 -2
10 1 -1 -1 1 0
11 -1 1 -1 1 0
12 1 1 -1 1 2
13 -1 -1 1 1 0
14 1 -1 1 1 2
15 -1 1 1 1 2
16 1 1 1 1 4

Now lets do the same experiment again, except that we are not looking at 4, but 16 tosses, which leads to \(2^{16}\) or 6.5536^{4} possible sequences. Here is the distribution of credit points.

One popular device to display such a process is a Galton1 board:

Linear regression model

What is the association between length and weight at birth?

We simulate some data:

dt = data.frame(length = rnorm(250,50,5))
expected_weight = 3.5 + scale(dt$length)*.5
dt$weight = rnorm(250,expected_weight,.5)

When data covary,we look at e.g. a scatter plot, which shows the joint distribution, to see how the data are related.

What is marginalization?

Sometimes, we want information about only one dimension of the data. This information is shown in the marginal distribution.

In this plot, each histogram shows the marginal distribution of length and weight, respectively. E.g. to get the frequency of length = 50cm, we sum all individuals with the eight, regardless of their weight.

Modeling the mean

Describing the model

To model such data, we typically think first about the likelihood function. The question here is, which distribution describes best the data we observed. Given that we can think of birth weight as the result of a sum of many processes, a normal distribution, with parameters \(\mu\) and \(\sigma\) for mean and standard deviation, makes sense.

One easy way to spot parameters is that they are typically Greek letters, whereas data variables are Roman letters.


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)


Parameters \(\mu\) and \(\sigma\) are variables that we cannot observe directly, we have to estimate them from the data and the prior. The table above already shows how they depend on the data, but we still need to add that they also depend on the prior:


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Prior \(\mu \sim Normal(3.5,1.5)\) mu ~ dnorm(3.5, 1.5)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)


And if we want to impress (or scare off) a colleague, we can write down the full model:

\[ \overset{Posterior}{P(\mu,\sigma|w)} = \frac{\prod_i \overset{Likelihood}{Normal(w_i|\mu,\sigma)} \cdot \overset{Prior}{Normal(mu|3.5,1.5) Uniform(\sigma|0,1.5)}}{\overset{Evidence}{\int\int \prod_i Normal(w_i|\mu,\sigma) \cdot Normal(mu|3.5,1.5) Uniform(\sigma|0,1.5)d\mu d\sigma}} \]

This looks scarier than it is. \(\prod_if\) just means that we calculate for each (individual) \(i\) the quantity the product behind the product sign \(\prod\) and than calculate the product of all these quantities.

The numerator just means to calculate the following for each combination of \(\mu\) and \(\sigma\)

  1. for each participant \(i\) calculate the product of
  • likelihood (\(Normal(w_i|\mu,\sigma)\)) and
  • prior (\(Normal(mu|3.5,1.5) Uniform(\sigma|0,1.5)\))
  1. calculate the product of all values generated in 1.

This is not how the calculation is really performed (expect when one uses grid approximation). Instead, methods lake Lapalce approximation or MCMC are used to calculate this quantity.

The denominator is what is called the evidence, and it’s main purpose is to insure that the posterior sums to 1.

Prior predictive check

It is good predictive to check if the model with priors make sensible predictions, before one estimates the model parameters. The goal is not to prior-predict data that are very similar to observed data, but to prior-predict data that pass plausibility checks and are in the same ball park as the observed data. Plausibility checks are simple things like that there are not negative weights. With “in the same ball park” one means that the values should ave approximately the same order of magnitude. For instance, newborns are a few kilogram heavy, and not 10s or 100s of kilogram. Prior predictive checks rely on domain knowledge, and can be very useful in understanding the effect of multivariate priors. We still give it a quick try for our simple example:

my_blue = adjustcolor("blue",alpha = .5) # for model predictions
my_gray = adjustcolor("black",alpha = .5) # for data

prior.pedictive.weights = 
  rnorm(
    10000,
    rnorm(10000, 3.5, 1.5),
    runif(10000, 0, 1.5))
hist(prior.pedictive.weights,
     main = "", xlab = "prior predictive weights",
     col = my_blue)
text(-2,2000,expression(mu~"="~Normal(3.5,1.5)), pos = 4)
text(-2,1850,expression(sigma~"="~Uniform(3.5,1.5)), pos = 4)

This looks reasonable, even though we surely know that there are no newborns with a weight below 0 or above 8 kilogram. But the point of the prior is not to forbid impossible values. Such impossible values should just be very rare, and if they have to be allowed in order to insure a weak influence of the prior on the results, this is OK.

Note that our choice of priors is not incluenced by the data we have, but by domain knowledge we have before seeing that data.

How does it look if we use non-informative priors?

It seems obvious that the first set of parameters makes more sense.

Estimating the model parameters

To analyse the model, we can just use the data.frame we created above:

head(dt)
##     length   weight
## 1 47.01580 3.122275
## 2 47.34526 2.647748
## 3 47.59354 2.251503
## 4 48.98099 3.534602
## 5 55.62158 4.148292
## 6 52.73807 4.428840

Based on the quap code in the table above, we can also put the quap model. alist is a command that creates a list that holds our model.

alpha.model.list =
  alist(
    weight ~ dnorm(mu,sigma),
    mu <- alpha,
    alpha  ~ dnorm(3.5,1.5),
    sigma  ~ dunif(0,1.5)
  )

For reasons that will be clear soon, we are using a parameter \(\alpha\) as a determinent of the mean \(\mu\).

Now we can use quap to calculate the posterior distribution.

alpha.model.fit = quap(alpha.model.list, data=dt)

And we can have a first glimpse of the results with the precis function from the rethinking package:

precis(alpha.model.fit)
##            mean         sd     5.5%     94.5%
## alpha 3.4988063 0.04528852 3.426427 3.5711861
## sigma 0.7164009 0.03203484 0.665203 0.7675988

Posterior predictive check

Before we start interpreting our modeling results, it is always a good idea to see if the model is any good at describing the data.

To do this, we first extract the posterior from the prior:

alpha.model.post = extract.samples(alpha.model.fit,n=1e4)
# calculate mu, because quap does not return it automtically
alpha.model.post$mu = alpha.model.post$alpha
head(alpha.model.post)
##      alpha     sigma       mu
## 1 3.552243 0.7436645 3.552243
## 2 3.572552 0.7155162 3.572552
## 3 3.503263 0.7512598 3.503263
## 4 3.457438 0.6800497 3.457438
## 5 3.630093 0.7303805 3.630093
## 6 3.499774 0.7017660 3.499774

One thing we would like to check is if the observed mean is within the credible interval of the posterior for \(\mu\):

hist(alpha.model.post$mu, main = "",
     xlab = expression(mu), col = my_blue )
abline(v = HPDI(alpha.model.post$mu), col = "blue")
abline(v = mean(dt$weight), lwd = 2)

We also expect that most of the data lies within the posterior predicted values. First we calculate the posterior predictions, which depend on \(\mu\) and \(\sigma\):

posterior.predictions = 
  rnorm(nrow(alpha.model.post),
        alpha.model.post$mu,
        alpha.model.post$sigma)

hist(dt$weight, col = my_gray, main = "", xlab = "weight")
hdpi = HPDI(posterior.predictions)
abline(v = hdpi , col = "blue")
in.hdpi = mean(dt$weight > hdpi[1] & dt$weight < hdpi[2])

title(paste0(round(in.hdpi*100),"% of weights are in the 89% HDPI"))

Now let’s simulate 200 weights and see if they are associated with length:

Differently than in the observed data, the predicted data do not show an association between length and weight. This is obviously because we did not use length to predict weight.

Linear regression

Instead of the previous model, where each individuals weight depends only on the group mean, we want a model in which individual weights also depend on birth length.

Looking at the right-hand side of the plot, we see that it is not so easy to think about what the mean weight is, length has a difficult scale. We can just re-scale it.

dt$length.s = scale(dt$length)
plot(dt$length.s,
     dt$weight,
     ylab = "weight",
     xlab = "length",
     ylim = ylim)

Now it is easier to see that the when length = 0 (the average) weight should be around 3.5 kg, and that when the weight goes up by 5 units (-3 to 2), length goes up by 4 units (2 to 6), that 0.8kg per cm length.

The model

Next we “build” the new model, which can be visualized as follows:

Here is the full model


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Linear model \(\mu_i = \alpha + \beta l_i\) mu[i] <- a + b * length.s[i]
Prior \(\alpha \sim Normal(3.5,1.5)\) alpha ~ dnorm(3.5, 1.5)
Prior \(\beta \sim Normal(1,1)\) beta ~ dnorm(1, 1)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)


The only real difference to the previous model is in lines 2 and 4. mu is now also a function of length, and we added a prior for the effect of length.

Prior predictive check

We do again a prior predictive check to see if model and prior are broadly in line with the data.

We are using the abline function for this. See the documentation!!!

plot(0,
     ylab = "weight",
     xlab = "scaled length",
     ylim = c(1,7),
     xlim = c(-3,3))
clr = adjustcolor("black",alpha = .25)
for (k in 1:50)
  abline(rnorm(1,3.5,1.5), rnorm(1,1,1), col = clr)

This looks pretty wild. It is implausible that we have a negative association between length and weight, and the mean weight covers an implausibly large range. We can do better than that, without that the model priors determine the fitting results.

A good rule to keep in mind is that for a prior \(Normal(\mu,\sigma))\) 95% of the prior mass lies between \(\mu + 2\sigma\) and \(\mu - 2\sigma\).

plot(0,
     ylab = "weight",
     xlab = "scaled length",
     ylim = c(1,7),
     xlim = c(-3,3))
for (k in 1:50)
  abline(rnorm(1,3.5,1), rlnorm(1,-.25,.5), col = clr)

This looks much more reasonable. Here is the model with new priors:

What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Trans. paras. \(\mu_i = \alpha + \beta l_i\) mu[i] <- a + b * length.s[i]
Prior \(\alpha \sim logNormal(3.5, 1)\) alpha ~ dnorm(3.5, 1)
Prior \(\beta \sim Normal(-.25, .5)\) beta ~ dlnorm(-.25, .5)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)

And here is the quap model:

mu.model.list =
  alist(
    weight ~ dnorm(mu,sigma),
    mu <- a + b * length.s,
    a  ~ dnorm(3.5,1),
    b  ~ dlnorm(-.25,.5),
    sigma  ~ dunif(0,1.5)
  )

Estimating the model parameters

We fit the quap model.

mu.model.fit = quap(mu.model.list, data=dt)
precis(mu.model.fit)
##            mean         sd      5.5%     94.5%
## a     3.4988165 0.03387102 3.4446841 3.5529490
## b     0.4787981 0.03354433 0.4251878 0.5324084
## sigma 0.5358553 0.02396413 0.4975560 0.5741546

Posterior predictive check

As should become custom, we check if our model describes the data reasonably well. We first extract the posterior samples.

mu.model.post = extract.samples(mu.model.fit)

Next we calculate the “transformed parameter” \(\mu\) for each participant. Because we have 10000 posterior samples and 250 participants, we end up with a 10000 time 250 matrix of posterior predictions.

n_samples = nrow(mu.model.post)

get_mu.model.pp = function(x,posterior, type = "pred"){
  x = matrix(x, ncol = 1)
  mu.model.pp = 
    apply(
      x, # instead of a for loop
      1,  # for each row in dt
      function(length.s){
        with(data.frame(posterior),
             {
               mu = a + b * length.s             # calculate mu
               if (type == "lin_pred") {
                 return(mu)
               } else {
                 pp = rnorm(n_samples,mu,sigma)  # generate post. preds
                 return(pp)
               }
             }
        )
      }
    )
}

mu.model.pp = get_mu.model.pp(dt$length.s,mu.model.post)

dim(mu.model.pp)
## [1] 10000   250

And we check if the predicted weights are consistent with the observed means.

Posterior predictive checks do not need to be limited to simlpy the outcome variable. We can check if any aspect of the data we care about works. Here we check if mean and sd of the predicted data lie within the credible interval of the posterior.

par(mfrow = c(1,2))

pp.mu = apply(mu.model.pp, 1, mean)
pp.sd = apply(mu.model.pp, 1 , sd)

hist(pp.mu, main = "", col = my_blue)
abline(v = c(HPDI(pp.mu),mean(dt$weight)), 
       col = c("blue","blue","black"))

hist(pp.sd, main = "", col = my_blue)
abline(v = c(HPDI(pp.sd),sd(dt$weight)), 
       col = c("blue","blue","black"))

We can also check if we see the covariation of birth length and birth weight of we use predicted weights.

Experience helps for doing good posterior predictive check. One rule guiding principle is that important characteristics of the data should still be visible in the posterior predictions.

Describe the results

Now that we have convinced of the that the model describes the data well enough (yes, this is to a degree subjective) we can present our results.

As a visual presentation we can e.g. just show the slope:

plot(dt$length.s,
     dt$weight,
     col = "grey", 
     ylab = "weight",
     xlab = "length",
     xaxt = "n")
x.ticks = seq(35,60,by = 5)
axis(1, 
     at = (x.ticks-mean(dt$length))/sd(dt$length), 
     labels = x.ticks)
for(k in 1:25) 
  abline(a = mu.model.post$a[k], 
         b = mu.model.post$b[k],
         col = adjustcolor("black",alpha = .2))

I like spaghetti plots, but confidence bands are more typically used:

This is how one could describe the results:

We found that a on standard deviation increases in birth length was associated with a 479 (credible interval 425, 533) gram increase of birth weight.

Splines

Splines are an alternative to polynomial regression if one wants to model non-linear relationships.

In polynomial regression, the regression weight for each basis functions changes the predicted values over the entire range of the predictor variables.

x = seq(-3,3,.1)
X.poly = poly(x, 2, raw = T)

b1 = c(1,1)
b2 = c(1,2)

y1 = X.poly %*% b1
y2 = X.poly %*% b2

par(mfrow = c(2,1)) 
matplot(X.poly, type = 'l', ylab = "basis function value", xlab = "")
title("Polynomials")
plot(x, y1, 'l', lwd = 2,
     ylim = range(c(y1,y2)),
     col = "blue",
     ylab = "y (weighted sum of basis functions)")
lines(x,y2, col = "red", lwd = 2, lty = 2)

As a result, it is not possible to fit local trends.

Splines partition the x-axis in overlapping regions, and parameter for basis functions modify the predicted values only in specific regions.

library(splines)
knot.locations = seq(-3,3, length.out = 5)
X.splines = bs(x, knots = knot.locations)

b1 = scale(1:ncol(X.splines))^2
b2 = b1
b2[(length(b2)-2):length(b2)] = 0

y1 = X.splines %*% b1
y2 = X.splines %*% b2

par(mfrow = c(2,1))
matplot(X.splines, type = 'l', ylab = "basis function value", xlab = "")
title("Spline basis functions")
plot(x, y1, 'l', lwd = 2,
     ylim = range(c(y1,y2)),
     col = "blue",
     ylab = "y (weighted sum of basis functions)")
lines(x,y2, col = "red", lty = 2, lwd = 2)

As a result, splines can model local variations.


  1. Who certainly was clever, but is nowadays also infamous for his views on eugenics and race.↩︎